I’m quite excited to participate in this course. I’ve already used R and R Markdown before but I know my way of coding is not efficient or clean. This is why I expect to learn better methods to use R, R Markdown and other tools introduced during the next month or so.
My PhD supervisor told me about this course because she thought I could greatly benefit from this.
# I want to print out today's date
date()
## [1] "Tue Apr 13 14:24:36 2021"
I did it. Hooray!
Let’s try this out.
First we read the students2014 data, which is provided in a website. After that, we look at what the data consist of.
# Download the data
students2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt ", sep=",", header=TRUE)
# First look at the data
head(students2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
# Look at the dimensions and structures of the data
dim(students2014)
## [1] 166 7
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
So the data include:
The data frame has 166 observations (rows) and 7 variables (columns). When we take a look at the structure, most individual observations are numerical. Two columns include integers (age, points) and one includes factors (gender).
Let’s look at a summary of the observations.
summary(students2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Most of the students are women and most students are in their 20s. The oldest student is 55 years old. Based on the summary, attitude, deep, strategic and surface scores have a scale from 1-5. The highest exam points has been 33 points, and most of the points were above 20.
Let’s plot the whole dataset to get a general idea, how the data behave in pairs.
plot(students2014[-1], col = students2014$gender)
This plot can be used to observe possible trends. We are interested in discovering which variable contributes to high or low exam points. For this reason, we’ll take a look at the last row or at the right most column, since exam points are paired to other variables on those sets. For example, on the second column’s (attitude) last row (points) we see some form of correlation. The data points scew from the lower left corner to the upper right corner in a similar fashion to a trend line, giving us a clue that the exam points seem to be high (or low) when the attitude scores are high (or low). A similar trend might be seen in the fifth column’s fourth row (stra).
We can use ggplot2 and GGally packages to generate a more advanced plot. Disregard the “upper” part inside the ggpairs command if you’re not using Linux. Otherwise FYI for Linux users: Ubuntu has some font issues with GGally and that part of the code is a workaround.
library(ggplot2)
library(GGally)
plot <- ggpairs(students2014, mapping = aes(col=gender, alpha = 0.3), upper = list(continuous = "cor_v1_5"), lower = list(combo = wrap("facethist", bins = 20)))
plot
We get some additional information out of this plot. It is important to look at the correlation values (Corr). The higher the absolute value, the greater the correlation between the variables. We observe that points and attitude have a positive correlation value of 0.437, which is the highest in this dataset. Other correlations worth noting are strategic and surface, which both have approximately a 0.14 value. Surface scores correlate negatively with points since the value is negative.
Since attitude, surf and stra are most correlated with points, we’ll fit a linear model on these variables. Our variable of interest is points (y) and our explanatory variables (x1, x2, x3) are attitude, stra and surf.
model <- lm(points ~ attitude + surf + stra, data = students2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + surf + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Considering explanatory variables, we see from the summary that only attitude has a significant correlation (p < 0.001) with points. Surf and stra don’t correlate significantly with points. We’ll run the model again with attitude only, since surf and stra were unnecessarily included.
model <- lm(points ~ attitude, data = students2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The p value became slightly more significant compared to our previous model. Additionally, we observe that attitude has a positive correlation with points since the estimate value is greater than zero. The intercept refers to the point where the regression line intercepts with y axis, so when x (attitude) is zero, y (points) is 11.6372. The intercept value is statistically significant (p < 0.001).
To confirm our model works for our data, we’ll run some diagnostics. Let’s generate the plots first.
par(mfrow = c(2,2))
plot(model, which = c(1,2,5))
Linear model assumes the residuals are normally distributed with a constant variance and that the residuals don’t correlate. From the Residuals vs. Fitted plot we observe that the data points vary in a quite random fashion. For example, if the data points were more close to each other on the left side but scattered on the right side, we would have a problem.
Q-Q plot can be used to confirm the normality assumption: the better the points fall on the line, the better the model fit. Some of the points in the beginning and end of the line stray, but overall the points follow our dashes quite nicely.
Residuals vs. Leverage is used for the dependacy assumption. We can to identify outliers if necessary. In our plot, there are no identifiable outliers, since the data points are fairly close to each other, which the x axis values confirm. If the x axis values were greater and there was a single point far away from the other data points, we would have to reconsider our model.
Overall, our diagnostics show that our linear model can be used to fit the data.
Let’s download the data we’re using for this exercise and explore the structure.
student_alc <- read.table("pormath.txt")
dim(student_alc)
## [1] 370 51
str(student_alc)
## 'data.frame': 370 obs. of 51 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
head(student_alc)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 15 R GT3 T 1 1 at_home other home
## 2 GP F 15 R GT3 T 1 1 other other reputation
## 3 GP F 15 R GT3 T 2 2 at_home other reputation
## 4 GP F 15 R GT3 T 2 4 services health course
## 5 GP F 15 R GT3 T 3 3 services services reputation
## 6 GP F 15 R GT3 T 3 4 services health course
## guardian traveltime studytime schoolsup famsup activities nursery higher
## 1 mother 2 4 yes yes yes yes yes
## 2 mother 1 2 yes yes no no yes
## 3 mother 1 1 yes yes yes yes yes
## 4 mother 1 3 yes yes yes yes yes
## 5 other 2 3 no yes yes yes yes
## 6 mother 1 3 yes yes yes yes yes
## internet romantic famrel freetime goout Dalc Walc health n id.p id.m failures
## 1 yes no 3 1 2 1 1 1 2 1096 2096 0
## 2 yes yes 3 3 4 2 4 5 2 1073 2073 1
## 3 no no 4 3 1 1 1 2 2 1040 2040 0
## 4 yes no 4 3 2 1 1 5 2 1025 2025 0
## 5 yes yes 4 2 1 2 3 3 2 1166 2153 1
## 6 yes no 4 3 2 1 1 5 2 1039 2039 0
## paid absences G1 G2 G3 failures.p paid.p absences.p G1.p G2.p G3.p failures.m
## 1 yes 3 10 12 12 0 yes 4 13 13 13 1
## 2 no 2 10 8 8 0 no 2 13 11 11 2
## 3 no 8 14 13 12 0 no 8 14 13 12 0
## 4 no 2 10 10 9 0 no 2 10 11 10 0
## 5 yes 5 12 12 12 0 yes 2 13 13 13 2
## 6 no 2 12 12 12 0 no 2 11 12 12 0
## paid.m absences.m G1.m G2.m G3.m alc_use high_use cid
## 1 yes 2 7 10 10 1.0 FALSE 3001
## 2 no 2 8 6 5 3.0 TRUE 3002
## 3 yes 8 14 13 13 1.0 FALSE 3003
## 4 yes 2 10 9 8 1.0 FALSE 3004
## 5 yes 8 10 10 10 2.5 TRUE 3005
## 6 yes 2 12 12 11 1.0 FALSE 3006
The dataset includes student performance information in secondary education of two Portuguese schools. We have 370 students (rows or observations) and 51 attributes (columns or variables). These 51 variables include sex, grades (G1, G2, G3), age, school, and information about the student’s social and family factors. Two distinct school subjects were of interest: math and Portuguese language.
More details about the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this exercise we want to explore the relationship between alcohol consumption and academic performance. The dataset was modified to inlcude alc_use and high_use, the first including average alcohol consumption during the week. The latter was constructed so that high alcohol use was set to be over 2. We are only including G3 grades, since this column includes final grades. Note that the data were wrangled with Reijo Sund’s code, so the dataset differs from the DataCamp set.
Let’s look at high_use, absences, study time, failures and if there are differences between men and women. I hypothesise high consumption results in lower grades, more absences, less study time and more failures. I also hypothesise there are differences between the sexes.
First we’ll cross-tabulate. I excluded absences from cross-tabulation since the results were hard to interpret.
library(tidyr); library(dplyr); library(ggplot2)
student_alc %>% group_by(failures, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 8 x 4
## # Groups: failures [4]
## failures high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 0 FALSE 238 12.2
## 2 0 TRUE 87 11.5
## 3 1 FALSE 12 8.08
## 4 1 TRUE 12 9.25
## 5 2 FALSE 8 7
## 6 2 TRUE 9 8.33
## 7 3 FALSE 1 10
## 8 3 TRUE 3 7.33
student_alc %>% group_by(studytime, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 8 x 4
## # Groups: studytime [4]
## studytime high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 1 FALSE 56 11
## 2 1 TRUE 42 10.4
## 3 2 FALSE 128 11.8
## 4 2 TRUE 57 10.8
## 5 3 FALSE 52 12.4
## 6 3 TRUE 8 13
## 7 4 FALSE 23 12.3
## 8 4 TRUE 4 12.5
student_alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 154 11.4
## 2 F TRUE 41 11.8
## 3 M FALSE 105 12.3
## 4 M TRUE 70 10.3
We see that 0 failures are associated with lower alcohol consumption (high_use FALSE) and higher grades, while 0 failures and high alcohol consumption is associated with lower mean grade. The relationship is not as straightforward with more failures. Most of the student’s study time falls to category 2, where no alcohol is associated with better grades, but we also have a case of high alcohol consumption and higher study time (3-4) with slightly higher grades. Finally, men with high alcohol consumption seem to have lower grades.
Now for some plots, in which we separate by sex.
ggplot(student_alc, aes(x = high_use, y = G3, col = sex))+ geom_boxplot() + ggtitle("Student grades by alcohol consumption and sex")
ggplot(student_alc, aes(x = high_use, y = absences, col = sex))+ geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
ggplot(student_alc, aes(x=studytime, fill = sex)) + geom_bar() + ggtitle("Study time by alcohol consumption and sex") + facet_wrap(~high_use)
ggplot(student_alc, aes(x=failures, fill = sex)) + geom_bar() + ggtitle("Failures by alcohol consumption and sex") + facet_wrap(~high_use)
From these plots we see that men’s grades are slightly lower when alcohol consumption is high. Absences seem to be slightly more common when alcohol consumption is high with both sexes. The relationship between consumption and study time is again not so straightforward. Failures on the other hand are more common in men when alcohol consumption is high. However, it’s good to bear in mind that the total number of high consumers is lower compared to low consumers of alcohol. My initial hypothesis was not too far off, but I’m surprised how different the results are between men and women.
We’ll run a generalised linear model with binomial distribution. We’ll explore the relationship of high alcohol use to the previously discussed variables failures, absences, sex, study time and final year grades (G3).
# Generalised linear model
model <- glm(high_use ~ failures + absences + sex + studytime + G3, data = student_alc, family = "binomial")
# Summary of the model
summary(model)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + studytime +
## G3, family = "binomial", data = student_alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0917 -0.8297 -0.5871 1.0191 2.1975
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.72528 0.61497 -1.179 0.238246
## failures 0.43511 0.22355 1.946 0.051610 .
## absences 0.08514 0.02320 3.670 0.000242 ***
## sexM 0.86638 0.25849 3.352 0.000803 ***
## studytime -0.32357 0.16400 -1.973 0.048505 *
## G3 -0.03905 0.04017 -0.972 0.330952
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 401.52 on 364 degrees of freedom
## AIC: 413.52
##
## Number of Fisher Scoring iterations: 4
We see that high alcohol use is statistically significantly associated with absences and with men. We see men (“sexM”) in the coefficients because the model used women (“sexF”) as a baseline, so compared to the baseline, male students are significantly associated with high alcohol use. Study time has a weaker relationship with high alcohol use, failures even more so, but the relationship is there. However, grades have no statistical significance in the model, so we’ll rerun the model without grades.
model <- glm(high_use ~ failures + absences + sex + studytime, data = student_alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + studytime,
## family = "binomial", data = student_alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0891 -0.8278 -0.5908 1.0193 2.1343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.15122 0.43431 -2.651 0.008033 **
## failures 0.50928 0.21127 2.411 0.015928 *
## absences 0.08650 0.02322 3.725 0.000195 ***
## sexM 0.84863 0.25759 3.295 0.000986 ***
## studytime -0.33939 0.16338 -2.077 0.037771 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 402.47 on 365 degrees of freedom
## AIC: 412.47
##
## Number of Fisher Scoring iterations: 4
We see that these four variables’ statistical significance increased. Also, the intercept became statistically significant, so it’s more relevant to compute predictions in future steps. Let’s calculate odds ratios and their confidence intervals.
# Odds ratios (OR)
OR <- coef(model) %>% exp
# Confidence intervals (CI)
CI <- confint(model) %>% exp
# Print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3162522 0.1336630 0.7367549
## failures 1.6640983 1.1072255 2.5501879
## absences 1.0903516 1.0441945 1.1440122
## sexM 2.3364412 1.4162568 3.8963237
## studytime 0.7122080 0.5124273 0.9741384
Odds higher than 1 indicate that our explanatory variable has a positive association with our dependent variable (high_use). Therefore, men, absences and failures have a positive assocation but study time doesn’t. When we look at the confidence intervals, no interval contains the value 1, which indicates that their p value is lower than 0.05 and we can conclude the null hypothesis (variable has no relationship with high alcohol use) is false with 95 % certainty.
Based on these results, the hypothesis of high alcohol conspumtion resulting in lower grades, more absences, less study time and more failures is false when it comes to grades and study time. However, high consumption has a positive association with more absences and more failures. I also hypothesised there would be a difference between men and women, and this has proven to be correct. We’ll run the GLM model one more time to exclude study time, in order increase statiscial significance and predictive power of our model.
model <- glm(high_use ~ failures + absences + sex, data = student_alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = student_alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1550 -0.8430 -0.5889 1.0328 2.0374
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.94150 0.23129 -8.394 < 2e-16 ***
## failures 0.59759 0.20698 2.887 0.00389 **
## absences 0.09245 0.02323 3.979 6.91e-05 ***
## sexM 0.99731 0.24725 4.034 5.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 406.99 on 366 degrees of freedom
## AIC: 414.99
##
## Number of Fisher Scoring iterations: 4
OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1434892 0.08940913 0.2219242
## failures 1.8177381 1.21997630 2.7642305
## absences 1.0968598 1.05041937 1.1508026
## sexM 2.7109881 1.67967829 4.4367282
We see that the statistical significance of failures increased and the odds ratios of failures and sexM increased.
Let’s compute some predictions on our data.
# Predict the probability of high_use
probabilities <- predict(model, type = "response")
# Add the predicted probabilities to student_alc
student_alc$probability <- probabilities
# Use the probabilities to make a prediction of high_use
student_alc <- mutate(student_alc, prediction = probability > 0.5)
# Tabulate the target variable versus the predictions
table(high_use = student_alc$high_use, prediction = student_alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 7
## TRUE 78 33
# Tabulate the target variable versus the predictions
table(high_use = student_alc$high_use, prediction = student_alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68108108 0.01891892 0.70000000
## TRUE 0.21081081 0.08918919 0.30000000
## Sum 0.89189189 0.10810811 1.00000000
Our 2x2 table is a confusion matrix and it shows how well the predictions worked. The first table shows the actual numbers while the second shows prediction probabilites. In the first table, we see that when high_use is FALSE, the model predicted FALSE 252 times, which is correct, and TRUE 7 times, which is incorrect. However, according to the same table, the model was not good at predicting actual high_use, predicting actual TRUE 33 times and incorrect FALSE 78 times. The second table shows the same info with probabilites of the predictions and according to that table, actual FALSE predictions are correct most of the time (p = 0.68), while actual TRUE predictions didn’t perform too well (p = 0.089).
Let’s perform a 10-fold cross-validation.
# Define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Compute the average number of wrong predictions in the (training) data
loss_func(student_alc$high_use, prob = student_alc$probability)
## [1] 0.2297297
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = student_alc, cost = loss_func, glmfit = model, K = 10)
# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2324324
The lower the average number of wrong predictions, the better. The training data had an average number of ~0.23 and our cross-validated data got an average number of 0.24. The cross-validated number which is slightly better than DataCamp’s 0.26 error. However, this can be due to different processing of the data, since the data wrangling process in DataCamp was not perfect and I used the corrected version by Reijo Sund.
Let’s download the data for this exercise.
library(MASS)
data("Boston")
The Boston dataset is titled “Housing Values in Suburbs of Boston” so the dataset includes information on variables which influence these values.
library(dplyr)
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
The dataset has 14 columns and 506 observations. These columns include things like
And so on. Let’s look at some data summaries.
library(ggplot2);library(GGally)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
ggpairs(Boston, upper = list(continuous = "cor_v1_5"))
There’s a lot going on in these summaries due to the amount of variables. Some of the variables tend to fall to an extreme in distribution. For example, crime rates are mostly on the low side. And there are many areas with a high proportion of black people. The variable for average number of rooms per dwelling (rm) is the only one that is normally distributed. We also have some binary variables, such as Charles River dummy variable (chas) and index of accessibility to radial highways (rad). Some linear relationships can be seen e.g. between nx and age, as well as zn and dis.
We’ll scale the data by subtracting the column means from the corresponding columns and divide the difference with standard deviation. It’s simple to perform with the scale function.
Boston_scaled <- scale(Boston)
# New summaries of the data after converting to data frame
Boston_scaled <- as.data.frame(Boston_scaled)
summary(Boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
ggpairs(Boston_scaled, upper = list(continuous = "cor_v1_5"))
The distributions didn’t change but scaling changed the units. Next we’ll create a categorical variable of crime rate.
# Creating a quantile vector
quantiles <- quantile(Boston_scaled$crim)
# Creating the categorical variable
crime <- cut(Boston_scaled$crim, breaks = quantiles, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Looks good. We’ll add this new crime variable to the dataset and remove the old one.
# Remove crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
# Add crime
Boston_scaled <- data.frame(Boston_scaled, crime)
Next we’ll create a test and training set. We’ll choose 80 % of the data for the training set, so that the remaining 20 % fall to the test set.
# Randomly choose 80% of the scaled Boston rows
ind <- sample(nrow(Boston_scaled), size = nrow(Boston_scaled) * 0.8)
# Create a training set
train <- Boston_scaled[ind,]
# Create a test set
test <- Boston_scaled[-ind,]
We’ll fit a linear discriminant analysis on the training set. For the sake of simplicity, we’ll run with two dimensions.
lda_model <- lda(crime ~ ., data = train)
lda_model
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2475248 0.2301980 0.2648515
##
## Group means:
## zn indus chas nox rm age
## low 1.0557183 -0.9684478 -0.08304540 -0.9195942 0.4425080 -0.9476028
## med_low -0.1395066 -0.2931305 0.04263895 -0.5405360 -0.1238290 -0.3024954
## med_high -0.3655244 0.1734128 0.19334945 0.3830847 0.1909604 0.4040860
## high -0.4872402 1.0170108 -0.01476176 1.0749031 -0.3743531 0.8079826
## dis rad tax ptratio black lstat
## low 0.9399577 -0.6881287 -0.7214038 -0.47556488 0.37865374 -0.77565846
## med_low 0.2842639 -0.5615323 -0.4511004 -0.09262755 0.31137642 -0.13053567
## med_high -0.3597911 -0.3644158 -0.2885715 -0.31421797 0.05551374 -0.01672123
## high -0.8576114 1.6392096 1.5148289 0.78203563 -0.82424628 0.80825471
## medv
## low 0.56235107
## med_low 0.01154643
## med_high 0.20547493
## high -0.66082389
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09848190 0.73643587 -0.80578885
## indus 0.03976686 -0.41778643 0.05123194
## chas -0.07980895 -0.01857315 0.13726551
## nox 0.48075805 -0.55228589 -1.44297535
## rm -0.11866804 -0.13368668 -0.22661954
## age 0.28666163 -0.36557583 -0.16859850
## dis 0.02444929 -0.26349594 -0.10688395
## rad 3.06738830 0.89081087 -0.19503976
## tax -0.10713901 0.09315526 0.82611831
## ptratio 0.11439060 0.04937801 -0.22607945
## black -0.13857858 -0.01349335 0.08509017
## lstat 0.26105874 -0.20735010 0.32649717
## medv 0.23804953 -0.31290314 -0.20776933
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9453 0.0411 0.0136
plot(lda_model, dimen = 2, col = as.numeric(train$crime), pch = as.numeric(train$crime))
We see that most of the high crime rate points separate from the rest. LD1 explains 94 % of the between-group variance.
We’ll run a prediction model with LDA. Before that, the crime classes will be removed from the test set.
# Save the correct classes from test data
correct_classes <- test$crime
# Remove the crime variable from test data
test <- dplyr::select(test, -crime)
# Predict with LDA
lda_predict <- predict(lda_model, newdata = test)
table(correct = correct_classes, predicted = lda_predict$class)
## predicted
## correct low med_low med_high high
## low 10 10 3 0
## med_low 8 15 3 0
## med_high 0 15 18 0
## high 0 0 0 20
The model is best at predicting high crime rate, medium high being the second best, medium low third, and low being the most difficult to predict.
Next we’ll calculate distances between the observations. The dataset needs to be reloaded, after which we’ll scale the data and then calculate the distances.
data("Boston")
Boston_rescaled <- scale(Boston)
# Euclidian distances
euc_Boston <- dist(Boston_rescaled)
#Manhattan distances
man_Boston <- dist(Boston_rescaled, method = "manhattan")
summary(euc_Boston)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
summary(man_Boston)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Manhattan distances tend to have higher units compared to Euclidian distances.
Next we’ll run K-means clustering and try to determine what is the best amount of centers for our data.
# Define a maximum and calculate the total within sum of squares
max <- 10
Twcss <- sapply(1:max, function(k){kmeans(Boston_rescaled, k)$tot.withinss})
# Visualise
qplot(x = 1:max, y = Twcss, geom = 'line')
There’s a radical drop in the within cluster sum of squares at about point 2. We’ll use this as an optimal number of clusters.
# K-means clustering
km <- kmeans(Boston_rescaled, centers = 2)
# Plot the Boston dataset with clusters
pairs(Boston_rescaled, col = km$cluster)
Based on this, the two cluster optimum didn’t come out of nowhere. The two clusters seem to fall into their own groups in many cases like with variables crime, indus, nox, black, lstat, dis. One notable excpetion is the variable chas (Charles River dummy varaible) where the points are mixed when paired with e.g. age and pupil-teacher ratio (ptratio). We can’t, however, infer what exactly creates the dissimilarity. That’s a whole other exercise.
library(dplyr);library(ggplot2);library(GGally);library(corrplot)
summary(human)
## edu_FM labour_FM edu_exp life_exp
## Min. :0.1717 Min. :13.50 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:44.45 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :53.50 Median :13.50 Median :74.20
## Mean :0.8529 Mean :52.52 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:61.90 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :88.10 Max. :20.20 Max. :83.50
## GNI mat_mort_ratio adol_birthrate rep_parliament
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(human, upper = list(continuous = "cor_v1_5"))
cor(human) %>% corrplot
Our dataset includes information about different countries’ human development variables. Most of the variables are numerical, with a few integers among the crowd. Looking at these graphical variable distributions we see that
If we look at the correlation plot, we see that for example, maternal mortality ratio is positively correlated with adolescent birthrate but negatively correlated with expected years of education, GNI and life expectancy. Parliamentary representation of women is slightly correlated with a higher amount of women participating in labour force, life expectancy and expected years of education. GNI is positively correlated with women in education, expected years of education and life expectancy but negatively correlated with maternal mortality ratio and adolescent birthrate.
We’ll draw a PCA plot with these unstandardised data.
pca <- prcomp(human)
biplot(pca, choices = 1:2, cex = c(0.6, 0.8))
It’s basically a mess. GNI is the only thing we can see point to the left (PC2). Let’s take a look at the components themselves.
summary(pca)$importance
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 18544.1640 185.6147 25.20641 15.37613 10.6753 3.766125
## Proportion of Variance 0.9999 0.0001 0.00000 0.00000 0.0000 0.000000
## Cumulative Proportion 0.9999 1.0000 1.00000 1.00000 1.0000 1.000000
## PC7 PC8
## Standard deviation 1.545704 0.1733653
## Proportion of Variance 0.000000 0.0000000
## Cumulative Proportion 1.000000 1.0000000
We see that PC1’s proportion of variance is basically 100 %. Doesn’t sound too realistic. We’ll standardise the data before making a new PCA plot.
human_scaled <- scale(human)
pca_human <- prcomp(human_scaled)
s1 <- summary(pca)
s2 <- summary(pca_human)
pca_pr1 <- round(100*s1$importance[2, ], digits = 1)
pca_pr2 <- round(100*s2$importance[2, ], digits = 1)
pc_lab1 <- paste0(names(pca_pr1), " (", pca_pr1, "%)")
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")
biplot(pca, choices = 1:2, cex = c(0.6, 0.8), xlab = pc_lab1[1], ylab = pc_lab1[2])
PCA plot of human data (no standardisation). As previously mentioned, PC1 causes 100 % of the variance between observations, but no distinct variable can be interpreted as the cause. Gross national income (GNI) points to the left, indicating that PC2 and GNI are linked, however, PC2 accounts for 0 % of the variance. Basically this plot is inconclusive.
biplot(pca_human, choices = 1:2, cex = c(0.6, 0.7), xlab = pc_lab2[1], ylab = pc_lab2[2])
PCA plot of human data (standardised). We no longer have a single principal component causing all the variance. PC1 accounts for 54.5 % of variance and PC2 accounts for 15.5 % of variance. Different variables go in groups, with parliamentary representation of women and the proportion of women in labour force pointing up, maternal mortality ratio and adolescent birthrate pointing to the right and GNI, expected years of eduation, life expectancy and proportion of women in secondary education pointing to the left. We also see what drives the countries to different sides of the plot. For example, maternal mortality ratio and adolescent birthrate are important variables for Gambia, proportion of women in labour force for Zimbabwe and parliamentary representation of women for Iceland.
As we see, the two plots are very different. PCA is sensitive to the relative scaling of the original variable features, and larger variance is interpreted as having more importance. The original data variables have different variances which we can see here:
var(human)
## edu_FM labour_FM edu_exp life_exp
## edu_FM 0.0583897 -4.222245e-01 0.4071587 1.159753
## labour_FM -0.4222245 2.574366e+02 -6.5651998 -37.076528
## edu_exp 0.4071587 -6.565200e+00 8.0670239 18.682196
## life_exp 1.1597535 -3.707653e+01 18.6821956 69.423283
## GNI 1928.1655568 -3.137263e+04 32883.4502723 96824.966255
## mat_mort_ratio -33.8243423 1.251877e+03 -442.5512317 -1512.597377
## adol_birthrate -5.2594013 1.947709e+02 -82.1542388 -249.778424
## rep_parliament 0.2182832 3.733809e+01 6.7240452 16.280088
## GNI mat_mort_ratio adol_birthrate rep_parliament
## edu_FM 1928.166 -3.382434e+01 -5.259401e+00 0.2182832
## labour_FM -31372.631 1.251877e+03 1.947709e+02 37.3380930
## edu_exp 32883.450 -4.425512e+02 -8.215424e+01 6.7240452
## life_exp 96824.966 -1.512597e+03 -2.497784e+02 16.2800884
## GNI 343874461.958 -1.944698e+06 -4.243095e+05 19003.7563259
## mat_mort_ratio -1944698.063 4.485483e+04 6.605745e+03 -217.6061751
## adol_birthrate -424309.466 6.605745e+03 1.690201e+03 -33.4746473
## rep_parliament 19003.756 -2.176062e+02 -3.347465e+01 131.9683058
var(human_scaled)
## edu_FM labour_FM edu_exp life_exp GNI
## edu_FM 1.00000000 -0.1089031 0.5932516 0.5760299 0.43030485
## labour_FM -0.10890308 1.0000000 -0.1440642 -0.2773393 -0.10544253
## edu_exp 0.59325156 -0.1440642 1.0000000 0.7894392 0.62433940
## life_exp 0.57602985 -0.2773393 0.7894392 1.0000000 0.62666411
## GNI 0.43030485 -0.1054425 0.6243394 0.6266641 1.00000000
## mat_mort_ratio -0.66093177 0.3684019 -0.7357026 -0.8571684 -0.49516234
## adol_birthrate -0.52941841 0.2952703 -0.7035649 -0.7291774 -0.55656208
## rep_parliament 0.07863528 0.2025733 0.2060816 0.1700863 0.08920818
## mat_mort_ratio adol_birthrate rep_parliament
## edu_FM -0.6609318 -0.5294184 0.07863528
## labour_FM 0.3684019 0.2952703 0.20257328
## edu_exp -0.7357026 -0.7035649 0.20608156
## life_exp -0.8571684 -0.7291774 0.17008631
## GNI -0.4951623 -0.5565621 0.08920818
## mat_mort_ratio 1.0000000 0.7586615 -0.08944000
## adol_birthrate 0.7586615 1.0000000 -0.07087810
## rep_parliament -0.0894400 -0.0708781 1.00000000
Looking at these tables it’s no wonder GNI was the only one that had a distinct arrow in the first PCA plot since GNI variance is huge. The variances are much more even in the second summary. If we compare the standardised PCA plot to the previously shown correlation plot, the results make sense. For example, we already established that maternal mortality ratio and adolescent birthrate are correlated. These two variables, as well as life expectancy, expected years of education, proportion of women in secondary education and GNI mostly belong to PC2. Labour force and parliamentary representation, on the other hand, mostly belong to PC1. Since PC1 accounts for 54.5 % of the whole variance, we can conclude parliamentary representation of women and proportion of women in labour force creates significant differences between the countries. GNI, life expectancy, expected years of education and proportion of women in secondary education are and “opposite force” to maternal mortality ratio and adolescent birthrate. More education goes with higher life expectancy, and if women gain access to secondary education, they are not as likely to stay home and give birth at a young age. Also, women are less likely to die to childbirth if they have access to secondary education. More education correlating with GNI makes sense, since if there are more educated people in the country, people have access to more jobs and there’s more money flowing in the country.
We’ll move on to the tea dataset from Factominer. Let’s explore the data.
library(FactoMineR);library(tidyr)
data(tea)
glimpse(tea)
## Rows: 300
## Columns: 36
## $ breakfast <fct> breakfast, breakfast, Not.breakfast, Not.breakfast, b…
## $ tea.time <fct> Not.tea time, Not.tea time, tea time, Not.tea time, N…
## $ evening <fct> Not.evening, Not.evening, evening, Not.evening, eveni…
## $ lunch <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch…
## $ dinner <fct> Not.dinner, Not.dinner, dinner, dinner, Not.dinner, d…
## $ always <fct> Not.always, Not.always, Not.always, Not.always, alway…
## $ home <fct> home, home, home, home, home, home, home, home, home,…
## $ work <fct> Not.work, Not.work, work, Not.work, Not.work, Not.wor…
## $ tearoom <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, N…
## $ friends <fct> Not.friends, Not.friends, friends, Not.friends, Not.f…
## $ resto <fct> Not.resto, Not.resto, resto, Not.resto, Not.resto, No…
## $ pub <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,…
## $ Tea <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl G…
## $ How <fct> alone, milk, alone, alone, alone, alone, alone, milk,…
## $ sugar <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar,…
## $ how <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag,…
## $ where <fct> chain store, chain store, chain store, chain store, c…
## $ price <fct> p_unknown, p_variable, p_variable, p_variable, p_vari…
## $ age <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 31, 56, 6…
## $ sex <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F,…
## $ SPC <fct> middle, middle, other worker, student, employee, stud…
## $ Sport <fct> sportsman, sportsman, sportsman, Not.sportsman, sport…
## $ age_Q <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-44, 35-4…
## $ frequency <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3 to 6/we…
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.escape-ex…
## $ spirituality <fct> Not.spirituality, Not.spirituality, Not.spirituality,…
## $ healthy <fct> healthy, healthy, healthy, healthy, Not.healthy, heal…
## $ diuretic <fct> Not.diuretic, diuretic, diuretic, Not.diuretic, diure…
## $ friendliness <fct> Not.friendliness, Not.friendliness, friendliness, Not…
## $ iron.absorption <fct> Not.iron absorption, Not.iron absorption, Not.iron ab…
## $ feminine <fct> Not.feminine, Not.feminine, Not.feminine, Not.feminin…
## $ sophisticated <fct> Not.sophisticated, Not.sophisticated, Not.sophisticat…
## $ slimming <fct> No.slimming, No.slimming, No.slimming, No.slimming, N…
## $ exciting <fct> No.exciting, exciting, No.exciting, No.exciting, No.e…
## $ relaxing <fct> No.relaxing, No.relaxing, relaxing, relaxing, relaxin…
## $ effect.on.health <fct> No.effect on health, No.effect on health, No.effect o…
#First 18 variables
gather(tea[,1:18]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 5))
# Last 18 variables
gather(tea[,19:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 5))
Most of the variables are factors, with one integer variable (age). The dataset was construced by asking 300 individuals (rows) how they drink tea, how they perceive their products along with personal questions (altogether 36 questions, columns).
We’ll plot the data with MCA. First, we’ll subset a bit to make the analysis a bit easier.
# Column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "age_Q")
tea_time <- dplyr::select(tea, one_of(keep_columns))
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.313 0.266 0.249 0.205 0.184 0.175 0.168
## % of var. 13.434 11.404 10.689 8.791 7.883 7.511 7.218
## Cumulative % of var. 13.434 24.838 35.527 44.318 52.201 59.712 66.930
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.147 0.141 0.135 0.119 0.101 0.072 0.056
## % of var. 6.302 6.051 5.787 5.086 4.348 3.094 2.403
## Cumulative % of var. 73.231 79.282 85.069 90.155 94.503 97.597 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.068 0.005 0.002 | 0.048 0.003 0.001 | -0.471
## 2 | 0.142 0.021 0.009 | 0.089 0.010 0.004 | -1.069
## 3 | -0.199 0.042 0.033 | -0.254 0.081 0.053 | -0.600
## 4 | -0.798 0.678 0.665 | -0.240 0.072 0.060 | 0.064
## 5 | -0.199 0.042 0.033 | -0.254 0.081 0.053 | -0.600
## 6 | -0.582 0.360 0.361 | -0.167 0.035 0.030 | -0.235
## 7 | -0.190 0.039 0.022 | -0.036 0.002 0.001 | -0.411
## 8 | 0.150 0.024 0.009 | 0.307 0.118 0.036 | -0.880
## 9 | 0.268 0.076 0.026 | 0.900 1.016 0.290 | 0.175
## 10 | 0.605 0.390 0.137 | 0.870 0.948 0.282 | -0.073
## ctr cos2
## 1 0.296 0.106 |
## 2 1.527 0.527 |
## 3 0.481 0.297 |
## 4 0.005 0.004 |
## 5 0.481 0.297 |
## 6 0.074 0.059 |
## 7 0.226 0.103 |
## 8 1.035 0.298 |
## 9 0.041 0.011 |
## 10 0.007 0.002 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## black | 0.750 7.377 0.184 7.421 | 0.467 3.365 0.071 4.618
## Earl Grey | -0.389 5.179 0.273 -9.036 | -0.016 0.010 0.000 -0.367
## green | 0.594 2.063 0.044 3.610 | -0.954 6.272 0.113 -5.800
## alone | -0.096 0.316 0.017 -2.253 | -0.262 2.803 0.128 -6.183
## lemon | 0.483 1.366 0.029 2.938 | 0.202 0.282 0.005 1.229
## milk | -0.091 0.093 0.002 -0.813 | 0.315 1.307 0.026 2.811
## other | 0.938 1.404 0.027 2.853 | 2.736 14.071 0.232 8.321
## tea bag | -0.460 6.376 0.277 -9.096 | -0.154 0.842 0.031 -3.046
## tea bag+unpackaged | 0.174 0.504 0.014 2.031 | 0.719 10.150 0.236 8.400
## unpackaged | 1.718 18.837 0.403 10.972 | -1.150 9.948 0.180 -7.346
## Dim.3 ctr cos2 v.test
## black | -0.740 9.026 0.179 -7.322 |
## Earl Grey | 0.334 4.788 0.201 7.751 |
## green | -0.293 0.629 0.011 -1.778 |
## alone | -0.047 0.098 0.004 -1.118 |
## lemon | 1.264 11.746 0.197 7.685 |
## milk | -0.379 2.011 0.038 -3.375 |
## other | -0.957 1.836 0.028 -2.910 |
## tea bag | -0.436 7.208 0.249 -8.627 |
## tea bag+unpackaged | 0.663 9.193 0.200 7.740 |
## unpackaged | 0.330 0.873 0.015 2.107 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.275 0.154 0.216 |
## How | 0.060 0.295 0.235 |
## how | 0.484 0.334 0.259 |
## sugar | 0.132 0.013 0.200 |
## where | 0.528 0.632 0.233 |
## age_Q | 0.402 0.169 0.354 |
There are altogether 14 dimensions. The first three dimensions retain 10-13 % of variance each. We only see the first 10 individuals (rows) and their contribution to the dimensiona (ctr) as well as squared correlations to the dimension (cos2). Same with the first 10 categories and their relationship to the first dimension. Lastly, we have the categorical variables and their squared correlation to each dimension. If these values are close to 1, there is a strong link with the variable and the dimension. We see the variable “where” has the strongest link to dimension 1, with “how” being quite close as a second and “age_Q” as a third. Additionally, the variable “where” has the strongest relationship with dimension 2 and the second “How” is has a stronger link with dimension 2 compared to dimension 1.
# Plot function for MCA, hide individual observations
plot.MCA(mca, invisible = c("ind"))
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
MCA plot of the subsetted tea data. The closer the variables, the more similar they are. Buying tea from specialised tea shops is similar to consuming unpackaged tea, which makes sense, since tea shops usually sell unpackaged tea. They both have a strong relationship to dimension 1. Green tea consumers are probably more likely to buy from unpackaged tea from tea shops. Tea consuption is different in the age groups. For example, 15-24 year olds are quite close to tea bag, Earl Grey and chain stores. Presumably chain stores mostly sell tea bags (instead of unpackaged) and Earl Grey is a popular option. Also, chain store Early Grey tea bags are likely cheap, and 15-24 year olds usually don’t have that high of an income. 60+ year olds are more likely to drink black tea. Drinking tea as it is (=alone) or with sugar are close to the centre, which indicates that consuming tea either of these ways vary enough between all subjects for those not to be pulled towards a certain dimension too strongly. No sugar, milk and lemon are a bit farther from the centre, but still quite close. Consuming tea in an other way is far from other categories, so they are a certain group of people.
The RATS study is a nutritional study on rats. The rats were assigned into three groups with different diets, and their body weight was recorded approximately weekly (week 7 includes two recordings) up until week 9. The researchers wanted to explore whether the growth profiles differ between the groups.
Let’s take a look at the data.
library(dplyr);library(tidyr);library(ggplot2)
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
The data frame has 5 columns and 176 rows. There are 16 rats in total, eight of them belong to group 1, four belong to group 2 and four to group 3. Rat ID is repeated due to weekly weight measurements. From the plot it seems Group 1 rats have had the lowest body weight throughout the study but groups 2 and 3 have had an increase. However, these rats also had more body weight to begin with. Group 2 has one rat whose body weight is high throughout the study, and group 3 as well as group 1 has a rat whose body weight is low compared to others in the group.
Let’s standardise the weights and make a new plot.
# Standardise weight
RATSL <- RATSL %>%
group_by(Group) %>%
mutate(stdweight=scale(Weight) ) %>%
ungroup()
# Look at the new data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ stdweight <dbl[,1]> <matrix[26 x 1]>
# Plot the data
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight") +
theme(legend.position = "none")
Now the weight increase looks similar in each group and the “outliers” mentioned before are now even more distinct.
Let’s look at the mean weight profiles.
# Extract number of measurement times
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of weight by group and measurement times
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30…
# Plot the mean profiles with standard error
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
geom_point(size=3) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)") +
theme(legend.position = "right")
There deos seem to be some more increase in weight in groups 2 and 3. Group 2 has a larger standard error than group 3. Group 1 has the lowest standard error.
Let’s see if there are still outliers after averaging weight.
# Create a summary data by group and ID with mean as the summary variable.
RATSL8S <- RATSL %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight) ) %>%
ungroup()
# Glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus group
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight)")
## Warning: `fun.y` is deprecated. Use `fun` instead.
From this plot we see there’s one actual outlier in group 1, which is the lowest datapoint. Group 2 and 3 seem to have outliers BUT we only have four(!) rats in these groups. The amount is ridiculously small that it’s dangerous to take anything out. Group 1 has 8 rats, so taking one out would still be feasible… But truth be told the whole research setting of one group having twice the amount of rats compared to other groups is not good. For the sake of my sanity I’d rather take four rats out of group 1 to make the groups more comparable. One of the excluded rats will be the outlier.
RATSL_exc <- RATSL8S %>%
filter(ID != 1, ID != 2, ID != 3, ID != 4) %>%
group_by(mean) %>%
ungroup()
glimpse(RATSL_exc)
## Rows: 12
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 269.4545, 274.7273, 274.6364, 265.4545, 440.8182, 452.7273, 454.…
Let’s perform a t-test for this dataset. T-test requires two sets at a time, so we’ll subset the data to suit thit test.
# Subsetting for t-tests
test1 <- RATSL_exc %>% filter(Group != 1)
test2 <- RATSL_exc %>% filter(Group != 2)
test3 <- RATSL_exc %>% filter(Group != 3)
# Two-sample t-tests
t.test(mean ~ Group, data = test1, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -1.1086, df = 6, p-value = 0.3101
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -131.79027 49.60845
## sample estimates:
## mean in group 2 mean in group 3
## 484.7045 525.7955
t.test(mean ~ Group, data = test2, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -22.612, df = 6, p-value = 4.898e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -282.2922 -227.1623
## sample estimates:
## mean in group 1 mean in group 3
## 271.0682 525.7955
t.test(mean ~ Group, data = test3, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -6.0255, df = 6, p-value = 0.0009433
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -300.3927 -126.8800
## sample estimates:
## mean in group 1 mean in group 2
## 271.0682 484.7045
The results are not significant between groups 2 and 3. The results are significant when comparing groups 1 against group 2 or group 3. The confidence interval includes 0 when comparing groups 2 and 3 but not when comparing group 1 against group 2 and 3, so we can reject the null hypothesis with 95 % certainty in the latter case.
Before we’ll fit the linear model, we’ll add a baseline to the data.
# Add the baseline from the original data as a new variable to the summary data
RATS_filtered <- RATS %>% filter(ID != 1, ID != 2, ID != 3, ID != 4)
RATSL_exc <- RATSL_exc %>%
mutate(baseline = RATS_filtered$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL_exc)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 164212 164212 1045.8116 9.109e-10 ***
## Group 2 700 350 2.2285 0.1701
## Residuals 8 1256 157
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA indicates there’s no significant difference in the groups in their weight gain compared to their baselines.
To summarise: The results are in weight gain are significant if comparing groups 2 and 3 to group 1 (t-test) but within group difference is negligible (ANOVA).
In this study we have 40 men who were randomly assigned into two treatment groups and their progress was monitored with brief psychiatric rating scale (BPRS). The measurements were made before treatment (week 0) up until week 8. BPRS assesses symptom constructs (e.g. hostility, hallucinations) and they are rated from one to seven (1 = not present, 7 = extremely severe).
We’ll take a look at the dataset
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, week0, week…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype=subject)) + facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name= "Weeks") + scale_y_continuous(name = "BPRS points") + theme(legend.position = "top") +
theme(legend.position = "none")
We see the two treatment groups look fairly similar. Treatment group 2 has one outlier subject whose BPRS score is high most of the time.
We’ll fit two models. Both are random intercept and random slope models, the first is without interaction and second with interaction. Well see the effect of weeks and treatments and their interaction to BPRS scores. Additionally, we’ll take each subject into account as a random effect.
library(lme4)
# Fitting the models
BPRS_model1 <- lmer(bprs ~ week + treatment + (weeks | subject), data = BPRSL)
# With interaction
BPRS_model2 <- lmer(bprs ~ week * treatment + (weeks | subject), data = BPRSL)
# Model summaries
summary(BPRS_model1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bprs ~ week + treatment + (weeks | subject)
## Data: BPRSL
##
## REML criterion at convergence: 2720.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8750 -0.5665 -0.0880 0.5723 3.3855
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 69.49 8.336
## weeksweek1 12.05 3.472 0.26
## weeksweek2 13.39 3.659 -0.22 0.63
## weeksweek3 22.84 4.779 -0.27 0.83 0.89
## weeksweek4 35.24 5.936 -0.59 0.61 0.58 0.85
## weeksweek5 44.38 6.662 -0.72 0.41 0.38 0.69 0.97
## weeksweek6 56.45 7.513 -0.64 0.45 0.32 0.67 0.96 0.99
## weeksweek7 55.05 7.419 -0.44 0.57 0.23 0.64 0.91 0.93
## weeksweek8 56.12 7.491 -0.32 0.65 0.24 0.66 0.89 0.88
## Residual 93.17 9.652
##
##
##
##
##
##
##
##
## 0.97
## 0.93 0.99
##
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 44.0875 1.9989 22.055
## week -2.1522 0.2958 -7.275
## treatment2 0.5722 1.0174 0.562
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.578
## treatment2 -0.254 0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
summary(BPRS_model2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bprs ~ week * treatment + (weeks | subject)
## Data: BPRSL
##
## REML criterion at convergence: 2717.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9990 -0.5893 -0.0968 0.5262 3.5095
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 69.91 8.361
## weeksweek1 12.17 3.488 0.25
## weeksweek2 13.73 3.706 -0.22 0.64
## weeksweek3 23.33 4.830 -0.27 0.83 0.89
## weeksweek4 35.73 5.977 -0.59 0.61 0.58 0.85
## weeksweek5 44.85 6.697 -0.72 0.41 0.39 0.69 0.97
## weeksweek6 56.96 7.547 -0.64 0.46 0.33 0.67 0.96 0.99
## weeksweek7 55.42 7.444 -0.44 0.57 0.23 0.64 0.91 0.93
## weeksweek8 56.48 7.515 -0.32 0.65 0.24 0.66 0.89 0.88
## Residual 92.44 9.614
##
##
##
##
##
##
##
##
## 0.97
## 0.93 0.99
##
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.4954 2.1453 21.207
## week -2.5082 0.3548 -7.069
## treatment2 -2.2911 1.8687 -1.226
## week:treatment2 0.7158 0.3925 1.824
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.436 0.465
## wek:trtmnt2 0.366 -0.553 -0.840
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
The slope for week is negative in the non-interaction model, suggesting that the BPRS scores go down with time. Treatment group 2 seems to have slightly higher BPRS scores compared to treatment group 1. The slopes are negative for week and treatment 2 in the interaction model, suggesting that treatment group 2 has lower BPRS scores than treatment group 1. However, the interaction between week and treatment does not indicate the same. We were right to add subject as a random effect, since subject variance is high in both models.
# Comparison of the models with ANOVA
anova(BPRS_model1, BPRS_model2)
## Data: BPRSL
## Models:
## BPRS_model1: bprs ~ week + treatment + (weeks | subject)
## BPRS_model2: bprs ~ week * treatment + (weeks | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_model1 49 2822.4 3012.8 -1362.2 2724.4
## BPRS_model2 50 2821.1 3015.4 -1360.5 2721.1 3.3344 1 0.06785 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the ANOVA test model 2 has slightly smaller AIC value but slightly bigger BIC value than model 1, and the likelihood ratio test of model 1 against model 2 gives a chi-squared statistic of 3.33 with 1 degree of freedom (Df). The p value is not statistically significant (> 0.05). These results do not clearly indicate which model is a better fit for our data.
Either way it seems there are no statistically significant differences between the treatment groups.
We’ll create new plots with fitted values from both of these models and compare them to the actual values, even though we can’t say the models are good fit for the data.
# Create a vector of the fitted values
Fitted1 <- fitted(BPRS_model1)
Fitted2 <- fitted(BPRS_model2)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(fitted1 = Fitted1, fitted2 = Fitted2)
# Draw the plot of BPRSL with the observed BPRS values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype=subject)) + facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name= "Weeks") + scale_y_continuous(name = "BPRS points") + theme(legend.position = "top") +
theme(legend.position = "none")
# Draw the plots of BPRSL with the Fitted values of BPRS (model 1)
ggplot(BPRSL, aes(x = week, y = fitted1, group = subject)) +
geom_line(aes(linetype = subject)) + facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "Fitted BPRS points") +
theme(legend.position = "none")
# Model 2
ggplot(BPRSL, aes(x = week, y = fitted2, group = subject)) +
geom_line(aes(linetype = subject)) + facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "Fitted BPRS points") +
theme(legend.position = "none")
The fitted plots are very similar to each other, so no help in that regard. What we do see is that the fitted values do okish at modelling our data. The fitted values are not good at handling outliers, so one subject randomly jumps up with their BPRS score and the subject with extreme BPRS scores has gone down compared to the actual values.